connection: clean up failed heartbeat sends#876
connection: clean up failed heartbeat sends#876dkropachev wants to merge 1 commit intoscylladb:masterfrom
Conversation
892831d to
358e016
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a heartbeat failure edge case where send_msg() can fail after a request id has been reserved, leaking the request id / callback registration and (for control connections) potentially leaving in_flight incorrectly incremented. It also adds a regression test to ensure request-id and in-flight bookkeeping is restored after a failed heartbeat send.
Changes:
- Update
HeartbeatFutureto unwind_requestsregistration and return the reserved request id whensend_msg()fails. - Ensure
in_flightis released directly only for control connections (since the control-connection owner’sreturn_connection()doesn’t decrementin_flight). - Add a unit test covering failed heartbeat send cleanup.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
cassandra/connection.py |
Adds failure-path cleanup in HeartbeatFuture for send_msg() exceptions (request id + _requests unwind; in_flight handling for control connections). |
tests/unit/test_connection.py |
Adds regression test ensuring request id pool, _requests, and in_flight are consistent after a heartbeat send failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| request_id = connection.get_request_id() | ||
| try: | ||
| connection.send_msg(OptionsMessage(), request_id, self._options_callback) | ||
| except Exception as exc: | ||
| connection.in_flight -= 1 | ||
| if request_id not in connection._requests and request_id not in connection.request_ids: | ||
| connection.request_ids.append(request_id) | ||
| self._exception = exc | ||
| self._event.set() | ||
| else: |
There was a problem hiding this comment.
Comment for the first commit. Assumption I have in the comment: if self.push(msg) fails then connection is broken.
There is one more edge case: if we wail after self._requests[request_id] = (cb, decoder, result_metadata), but before self.push(msg) then _requests will have the new request, but it won't be sent, effectively making it orphaned, without being accounted as such.
Can we fix that? It would require having a cleanup path for request_id in connection._requests, which I'm afraid is a bit too dangerous.
| if connection.is_control_connection: | ||
| connection.in_flight -= 1 | ||
| # send_msg() registers the callback before writing to the socket, | ||
| # so a write failure must unwind that registration here. | ||
| connection._requests.pop(request_id, None) | ||
| if request_id not in connection.request_ids: |
There was a problem hiding this comment.
Sorry, I just don't understand the second commit. Why do you treat CC differently here? Why do we care about socket write error - won't it result in connection being closed anyway?
There was a problem hiding this comment.
CC is special because ControlConnection.return_connection() does not decrement in_flight, while HostConnection.return_connection() does. The write failure still matters because send_msg() has already registered the callback and reserved the request id before push(), so that bookkeeping has to be unwound explicitly.
There was a problem hiding this comment.
You mention return_connection but I don't see where it is called :(
There was a problem hiding this comment.
I meant the later owner.return_connection(connection) in ConnectionHeartbeat.run:
python-driver/cassandra/connection.py
Lines 1874 to 1927 in 6ad10b4
python-driver/cassandra/connection.py
Lines 1921 to 1927 in 6ad10b4
That block only unwinds the callback/request-id registration that send_msg() already did:
python-driver/cassandra/connection.py
Lines 1205 to 1231 in 6ad10b4
For control connections, ControlConnection.return_connection() does not decrement in_flight
python-driver/cassandra/cluster.py
Lines 4267 to 4269 in 6ad10b4
while
HostConnection.return_connection() does, so the direct decrement has to stay here. It can’t be handled in ControlConnection.return_connection() because that method only sees a defunct/closed connection at the end of the heartbeat cycle
python-driver/cassandra/pool.py
Lines 542 to 547 in 6ad10b4
It does not know which request_id was reserved, and it does not have the context that send_msg() already registered the callback in _requests. The leak happens in the send_msg() failure path, while we still have the request_id and can unwind _requests / request_ids immediately.`
Keep heartbeat request-id and in-flight bookkeeping consistent when send_msg() fails.\n\nHandle the control-connection in_flight release separately from HostConnection cleanup.
358e016 to
6ad10b4
Compare
|
is this the right way to fix it? |
Can't agree more, there has to be an infrastructure that would handle request scheduling, executing, failing in respect of borrowing/returning request_id, tracking in_flight, etc. |
Fixes #875.
Heartbeat sends can fail after a request id has already been reserved. This change keeps the request-id pool and in-flight accounting consistent across that failure path, and avoids double-releasing the slot on the control-connection branch.
Changes:
send_msg()fails